{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Topic Modeling: Latent Dirichlet Allocation with gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gensim is a specialized NLP library with a fast LDA implementation and many additional features. We will also use it in the next chapter on word vectors (see the notebook lda_with_gensim for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:26.731017Z",
     "start_time": "2020-06-20T19:41:26.728853Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.497153Z",
     "start_time": "2020-06-20T19:41:26.732690Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!\n",
      "scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.\n",
      "  _deprecated()\n"
     ]
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from pathlib import Path\n",
    "import pandas as pd\n",
    "\n",
    "# Visualization\n",
    "import seaborn as sns\n",
    "import pyLDAvis\n",
    "\n",
    "# sklearn for feature extraction & modeling\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.model_selection import train_test_split\n",
    "import joblib\n",
    "\n",
    "# gensim for alternative models\n",
    "from gensim.models import LdaModel\n",
    "from gensim.corpora import Dictionary\n",
    "from gensim.matutils import Sparse2Corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.501345Z",
     "start_time": "2020-06-20T19:41:27.498551Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.set_style('white')\n",
    "pyLDAvis.enable_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Load BBC data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.510164Z",
     "start_time": "2020-06-20T19:41:27.502413Z"
    }
   },
   "outputs": [],
   "source": [
    "# change to your data path if necessary\n",
    "DATA_DIR = Path('../data')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.580346Z",
     "start_time": "2020-06-20T19:41:27.511351Z"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "path = DATA_DIR / 'bbc'\n",
    "files = path.glob('**/*.txt')\n",
    "doc_list = []\n",
    "for i, file in enumerate(files):\n",
    "    with open(str(file), encoding='latin1') as f:\n",
    "        topic = file.parts[-2]\n",
    "        lines = f.readlines()\n",
    "        heading = lines[0].strip()\n",
    "        body = ' '.join([l.strip() for l in lines[1:]])\n",
    "        doc_list.append([topic.capitalize(), heading, body])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Convert to DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.587013Z",
     "start_time": "2020-06-20T19:41:27.581227Z"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 2225 entries, 0 to 2224\n",
      "Data columns (total 3 columns):\n",
      " #   Column   Non-Null Count  Dtype \n",
      "---  ------   --------------  ----- \n",
      " 0   topic    2225 non-null   object\n",
      " 1   heading  2225 non-null   object\n",
      " 2   article  2225 non-null   object\n",
      "dtypes: object(3)\n",
      "memory usage: 52.3+ KB\n"
     ]
    }
   ],
   "source": [
    "docs = pd.DataFrame(doc_list, columns=['topic', 'heading', 'article'])\n",
    "docs.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Create Train & Test Sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.597830Z",
     "start_time": "2020-06-20T19:41:27.588331Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "train_docs, test_docs = train_test_split(docs, \n",
    "                                         stratify=docs.topic, \n",
    "                                         test_size=50, \n",
    "                                         random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.608137Z",
     "start_time": "2020-06-20T19:41:27.598869Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((2175, 3), (50, 3))"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_docs.shape, test_docs.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:27.617832Z",
     "start_time": "2020-06-20T19:41:27.609175Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sport            12\n",
       "Business         11\n",
       "Tech              9\n",
       "Entertainment     9\n",
       "Politics          9\n",
       "Name: topic, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series(test_docs.topic).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Vectorize train & test sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:28.002409Z",
     "start_time": "2020-06-20T19:41:27.618688Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<2175x2000 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 179073 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer = CountVectorizer(max_df=.2, \n",
    "                             min_df=3, \n",
    "                             stop_words='english', \n",
    "                             max_features=2000)\n",
    "\n",
    "train_dtm = vectorizer.fit_transform(train_docs.article)\n",
    "words = vectorizer.get_feature_names()\n",
    "train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:28.013400Z",
     "start_time": "2020-06-20T19:41:28.003251Z"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<50x2000 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 3766 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_dtm = vectorizer.transform(test_docs.article)\n",
    "test_dtm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## LDA with gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Using `CountVectorizer` Input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:28.293954Z",
     "start_time": "2020-06-20T19:41:28.014336Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "max_df = .2\n",
    "min_df = 3\n",
    "max_features = 2000\n",
    "\n",
    "# used by sklearn: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py\n",
    "stop_words = pd.read_csv('http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words', \n",
    "                         header=None, \n",
    "                         squeeze=True).tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:28.688688Z",
     "start_time": "2020-06-20T19:41:28.294887Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer(max_df=max_df, \n",
    "                             min_df=min_df, \n",
    "                             stop_words='english', \n",
    "                             max_features=max_features)\n",
    "\n",
    "train_dtm = vectorizer.fit_transform(train_docs.article)\n",
    "test_dtm = vectorizer.transform(test_docs.article)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert sklearn DTM to gensim data structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It faciltiates the conversion of DTM produced by sklearn to gensim data structures as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:28.694502Z",
     "start_time": "2020-06-20T19:41:28.689527Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "train_corpus = Sparse2Corpus(train_dtm, documents_columns=False)\n",
    "test_corpus = Sparse2Corpus(test_dtm, documents_columns=False)\n",
    "id2word = pd.Series(vectorizer.get_feature_names()).to_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Train Model & Review Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:31.165214Z",
     "start_time": "2020-06-20T19:41:28.695922Z"
    }
   },
   "outputs": [],
   "source": [
    "LdaModel(corpus=train_corpus, \n",
    "         num_topics=100, \n",
    "         id2word=None, \n",
    "         distributed=False, \n",
    "         chunksize=2000,                   # Number of documents to be used in each training chunk.\n",
    "         passes=1,                         # Number of passes through the corpus during training\n",
    "         update_every=1,                   # Number of docs to be iterated through for each update\n",
    "         alpha='symmetric', \n",
    "         eta=None,                         # a-priori belief on word probability\n",
    "         decay=0.5,                        # percentage of previous lambda forgotten when new document is examined\n",
    "         offset=1.0,                       # controls slow down of the first steps the first few iterations.\n",
    "         eval_every=10,                    # estimate log perplexity\n",
    "         iterations=50,                    # Maximum number of iterations through the corpus\n",
    "         gamma_threshold=0.001,            # Minimum change in the value of the gamma parameters to continue iterating\n",
    "         minimum_probability=0.01,         # Topics with a probability lower than this threshold will be filtered out\n",
    "         random_state=None, \n",
    "         ns_conf=None, \n",
    "         minimum_phi_value=0.01,           # if `per_word_topics` is True, represents lower bound on term probabilities\n",
    "         per_word_topics=False,            #  If True, compute a list of most likely topics for each word with phi values multiplied by word count\n",
    "         callbacks=None);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:31.168273Z",
     "start_time": "2020-06-20T19:41:31.166193Z"
    }
   },
   "outputs": [],
   "source": [
    "num_topics = 5\n",
    "topic_labels = ['Topic {}'.format(i) for i in range(1, num_topics+1)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:32.469679Z",
     "start_time": "2020-06-20T19:41:31.169486Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "lda_gensim = LdaModel(corpus=train_corpus,\n",
    "                      num_topics=num_topics,\n",
    "                      id2word=id2word)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:32.474247Z",
     "start_time": "2020-06-20T19:41:32.470681Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0,\n",
       " '0.006*\"england\" + 0.006*\"growth\" + 0.005*\"game\" + 0.005*\"chelsea\" + 0.004*\"players\" + 0.004*\"united\" + 0.004*\"market\" + 0.004*\"prices\" + 0.004*\"economy\" + 0.004*\"europe\"')"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "topics = lda_gensim.print_topics()\n",
    "topics[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Evaluate Topic Coherence\n",
    "\n",
    "Topic Coherence measures whether the words in a topic tend to co-occur together. \n",
    "\n",
    "- It adds up a score for each distinct pair of top ranked words. \n",
    "- The score is the log of the probability that a document containing at least one instance of the higher-ranked word also contains at least one instance of the lower-ranked word.\n",
    "\n",
    "Large negative values indicate words that don't co-occur often; values closer to zero indicate that words tend to co-occur more often."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:32.558704Z",
     "start_time": "2020-06-20T19:41:32.475420Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "coherence = lda_gensim.top_topics(corpus=train_corpus, coherence='u_mass')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gensim permits topic coherence evaluation that produces the topic coherence and shows the most important words per topic: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:32.669199Z",
     "start_time": "2020-06-20T19:41:32.559671Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Topic 1          Topic 2         Topic 3          Topic 4           Topic 5         \n",
      "     prob     term    prob    term    prob     term    prob      term    prob     term\n",
      "0   0.63%  england   0.59%    home   0.75%     film   0.92%    labour   0.58%  england\n",
      "1   0.62%     bank   0.55%    film   0.53%   mobile   0.90%     music   0.57%   growth\n",
      "2   0.55%    wales   0.44%   china   0.50%   market   0.82%  election   0.53%     game\n",
      "3   0.42%  company   0.44%    best   0.46%  players   0.76%     brown   0.45%  chelsea\n",
      "4   0.39%      tax   0.43%  labour   0.42%      old   0.71%     blair   0.44%  players\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAr4AAAESCAYAAAAbsdZ9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAPdklEQVR4nO3da4hVddvA4Xt01DGNTOZTar0RlFJESmhBBQlBKCIZO4wny0oqo6jGsgNEYjYZFVSIZdFBo0wnrQ9FgZ2gsoyCiqIDKYLYhyQ7jJU5zcz7QV4HafX4zm7vvXbd1wWBszfj3Cxu4sffNWu39Pf39wcAAPzLDSl7AAAAaAThCwBACsIXAIAUhC8AACkIXwAAUhC+AACk0FrNN/X19cWSJUviq6++iuHDh8eyZcvimGOOqfVsAABQM1Wd+L722muxb9++WLduXSxatCiWL19e67kAAKCmqgrfjz76KM4888yIiDjllFPis88+q+lQAABQa1WF7549e2L06NEHvh46dGj88ccfNRsKAABqrap7fEePHh2//PLLga/7+vqitfW//1XTpk2LcePGVfPjaqa/P6KlpdQRmoZrMcC1GOBaDHAtBrgWA1yLAa7FANdiQDNci507d8aWLVsK36sqfKdMmRJvvvlmzJgxIz7++OM4/vjjD/k948aNi40bN1bz42rqf255uewRmsL25TPLHqGp2Iv97MXB7MV+9uJg9mI/e3Ewe7FfM+zFnDlz/vK9qsL3nHPOiXfffTfmzp0b/f390dnZWfVwAADQCFWF75AhQ2Lp0qW1ngUAAOrGB1gAAJCC8AUAIAXhCwBACsIXAIAUhC8AACkIXwAAUhC+AACkIHwBAEhB+AIAkILwBQAgBeELAEAKwhcAgBSELwAAKQhfAABSEL4AAKQgfAEASEH4AgCQgvAFACAF4QsAQArCFwCAFIQvAAApCF8AAFIQvgAApCB8AQBIQfgCAJCC8AUAIAXhCwBACsIXAIAUhC8AACkIXwAAUhC+AACkIHwBAEhB+AIAkILwBQAgBeELAEAKwhcAgBSELwAAKfyt8N20aVMsWrSoVrMAAEDdtFb7jcuWLYt33nknJk2aVMt5AACgLqo+8Z0yZUosWbKkhqMAAED9HPLEt6urK1avXn3Qa52dnTFjxozYsmVL3QYDAIBaOmT4ViqVqFQqjZgFAADqxlMdAABIQfgCAJBC1U91iIiYNm1aTJs2rVazAABA3TjxBQAgBeELAEAKwhcAgBSELwAAKQhfAABSEL4AAKQgfAEASEH4AgCQgvAFACAF4QsAQArCFwCAFIQvAAApCF8AAFIQvgAApCB8AQBIQfgCAJCC8AUAIAXhCwBACsIXAIAUhC8AACkIXwAAUhC+AACkIHwBAEhB+AIAkILwBQAgBeELAEAKwhcAgBSELwAAKQhfAABSEL4AAKQgfAEASEH4AgCQgvAFACAF4QsAQArCFwCAFIQvAAAptFbzTd3d3XHTTTfFnj17oqenJ2655ZaYPHlyrWcDAICaqSp8n3zyyTjttNNi/vz5sW3btli0aFG88MILtZ4NAABqpqrwnT9/fgwfPjwiInp7e2PEiBE1HQoAAGrtkOHb1dUVq1evPui1zs7OOPnkk2PXrl1x0003xW233Va3AQEAoBYOGb6VSiUqlcqfXv/qq6+io6MjFi9eHFOnTq3LcAAAUCtV3erwzTffxHXXXRcPPPBATJw4sdYzAQBAzVUVvvfff3/s27cv7rrrroiIGD16dDz88MM1HQwAAGqpqvAVuQAA/NP4AAsAAFIQvgAApCB8AQBIQfgCAJCC8AUAIAXhCwBACsIXAIAUhC8AACkIXwAAUhC+AACkIHwBAEhB+AIAkILwBQAgBeELAEAKwhcAgBSELwAAKQhfAABSEL4AAKQgfAEASEH4AgCQgvAFACAF4QsAQArCFwCAFIQvAAApCF8AAFIQvgAApCB8AQBIQfgCAJCC8AUAIAXhCwBACsIXAIAUhC8AACkIXwAAUhC+AACkIHwBAEhB+AIAkEJr2QM00t6e3ti+fGbZYzSFvT290TZsaNljAAA0TFXh++uvv8aiRYvip59+ipEjR8a9994bY8eOrfVsNSf0BrgWAEA2Vd3qsH79+jjxxBPj2WefjZkzZ8bKlStrPRcAANRUVSe+8+fPj97e3oiI+Pbbb6O9vb2mQ0EjuQVmgFtgAPg3O2T4dnV1xerVqw96rbOzM04++eS4+OKL4+uvv44nn3yybgNCvQm9Aa4FAP9mhwzfSqUSlUql8L01a9bE1q1b48orr4zXXnut5sMBAECtVHWP76pVq+LFF1+MiIjDDjsshg51SgQAQHOr6h7f888/P26++ebYsGFD9Pb2RmdnZ63nAgCAmqoqfNvb2+Pxxx+v9SwAAFA3PrkNAIAUhC8AACmk+shigP8vz3ce4PnOwL+FE1+AAkJvgGsB/FsIXwAAUhC+AACkIHwBAEhB+AIAkILwBQAgBeELAEAKwhcAgBSELwAAKQhfAABSEL4AAKQgfAEASEH4AgCQgvAFACAF4QsAQArCFwCAFIQvAAApCF8AAFIQvgAApCB8AQBIQfgCAJCC8AUAIAXhCwBACsIXAIAUhC8AACkIXwAAUhC+AACkIHwBAEhB+AIAkILwBQAgBeELAEAKwhcAgBSELwAAKbT+nW/eunVrXHDBBbF58+YYMWJErWYCgKa0t6c3ti+fWfYYTWFvT2+0DRta9hgwKFWf+O7ZsyfuueeeGD58eC3nAYCmJfQGuBb8E1UVvv39/XH77bdHR0dHjBw5stYzAQBAzR3yVoeurq5YvXr1Qa8dddRRMWPGjJg4cWLdBgMAgFo6ZPhWKpWoVCoHvXbOOefEhg0bYsOGDbFr16647LLL4plnnqnbkAAA8HdV9cttmzZtOvDn6dOnxxNPPFGzgQAAoB48zgwAgBT+1uPMIiLeeOONWswBAAB15cQXAIAUhC8AACkIXwAAUhC+AACkIHwBAEhB+AIAkILwBQAgBeELAEAKwhcAgBSELwAAKQhfAABSEL4AAKQgfAEASEH4AgCQgvAFACAF4QsAQArCFwCAFIQvAAApCF8AAFIQvgAApCB8AQBIQfgCAJCC8AUAIAXhCwBACsIXAIAUWsseAADgn2xvT29sXz6z7DGawt6e3mgbNrTsMf6SE18AgL+hmUOv0Zr9WghfAABSEL4AAKQgfAEASEH4AgCQgvAFACAF4QsAQArCFwCAFIQvAAApNOyT23bu3Blz5sxp1I8DACChnTt3/uV7Lf39/f0NnAUAAErhVgcAAFIQvgAApCB8AQBIQfgCAJCC8AUAIAXhCwBACsIXStDd3R2//fbbQa/9t+cOks+OHTvsBH/y5Zdflj0CTWb37t3x0UcfxY8//lj2KP8IwhcarKurK84///yYNWtWPPbYYwdev/XWW0ucirJ9+umnMXv27Lj88svjxRdfjIULF8Y111wTXV1dZY9Gid55552D/rv55psP/Jm8rrjiioiIeOutt+LCCy+Mp59+Oi666KJ44403Sp6s+TXsk9symjVrVvzwww+F7/mfVl7r16+Pl156KSL2x+4jjzwSV111Vfgsmdw6Oztj5cqVsXPnzli4cGG8/fbbMWzYsJg3b15UKpWyx6Mk9913XwwZMiROOOGEiIj4/vvv4+WXX46IiDPOOKPM0SjR3r17IyLisccei7Vr18bYsWPjl19+iQULFsT06dNLnq65Cd86WrFiRXR0dMQzzzwTbW1tZY9Dkxg6dGgMHz48IiLuueeeWLBgQYwfPz5aWlpKnowy9fX1xbhx42LcuHFx0UUXxWGHHRYRYS+SW7t2bSxdujSmTJkSlUol5s2bF3fffXfZY1GyP/74IyIiDj/88BgzZkxERIwaNSr6+vrKHOsfwa0OdXTMMcfExRdfHFu2bCl7FJrIlClT4tprr43u7u5obW2Nhx56KJ544gn37iV3+umnx6WXXhp9fX1xww03RETE0qVLD5z0kdPIkSPj7rvvju7u7rjjjjuit7e37JFoAkcccUTMnDkzPv/881izZk389ttvceWVV8Ypp5xS9mhNr6Xfv69Cw23ZsiUmT5584OT3999/j7Vr18b8+fPLHYxSffHFFzFp0qQDX7///vsxderUGDLEGQUR7733Xjz//PNx//33lz0KTeL777+Pnp6eaG9vj82bN8dZZ51V9khNT/gCAJCCYwQAAFIQvg3yf89s/e6770qehGZiLyhiLyhiLyhiLwZH+DbAihUr4sEHH4yIiGXLlsWjjz5a8kQ0A3tBEXtBEXtBEXsxeO7xbYA5c+bExo0bD3w9d+7ceO6550qciGZgLyhiLyhiLyhiLwbPiW8DtLS0xL59+yIioqenxwcVEBH2gmL2giL2giL2YvB8gEUDzJ07N2bNmhXHH398bNu2LRYsWFD2SDQBe0ERe0ERe0ERezF4bnVokN27d8eOHTtiwoQJMXbs2LLHoUnYC4rYC4rYC4rYi8Fx4ltHK1eujKuvvjo6Ojr+9LGjHkCel72giL2giL2giL2onvCto+nTp0fE/n+KiNh/L44DduwFRewFRewFRexF9YYuWbJkSdlD/Fu1t7dHRERbW1usW7cuXn/99fj555/j3HPPjba2tpKnoyz2giL2giL2giL2onqe6tAA119/fRx33HFx4403xvjx42Px4sVlj0QTsBcUsRcUsRcUsReD51aHBrnwwgsjImLixInx6quvljwNzcJeUMReUMReUMReDI5bHRrgk08+ie7u7hgzZkx88MEHsX379pg0aVL8+OOPceSRR5Y9HiWxFxSxFxSxFxSxF4PncWYNMG/evIj4883nLS0tsWbNmrLGomT2giL2giL2giL2YvCEb4P88MMPsWPHjhg/frzn7HGAvaCIvaCIvaCIvRgctzo0wCuvvBI33nhjbNu2LVatWhVHHHFETJw4seyxKJm9oIi9oIi9oIi9GDy/3NYATz31VGzcuDFGjRoVe/bsiUsuuSRmz55d9liUzF5QxF5QxF5QxF4MnseZNUBLS0uMGjUqIiJGjx4dI0aMKHkimoG9oIi9oIi9oIi9GDwnvg1w9NFHx/Lly+PUU0+NDz/8MI4++uiyR6IJ2AuK2AuK2AuK2IvBc+JbR9dff31ERHR2dsaECRNi8+bNMWHChLjzzjtLnowy2QuK2AuK2AuK2IvqOfGto927d0dERGtra/znP/8peRqahb2giL2giL2giL2onseZ1dHZZ58ds2bNKnyvo6OjwdPQLOwFRewFRewFRexF9Zz41lFbW1sce+yxZY9Bk7EXFLEXFLEXFLEX1RO+ddTe3h7nnXde2WPQZOwFRewFRewFRexF9fxyWx2ddNJJZY9AE7IXFLEXFLEXFLEX1XOPLwAAKTjxBQAgBeELAEAKwhcAgBSELwAAKQhfAABS+F/X6+aeEJeCZgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 864x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "topic_coherence = []\n",
    "topic_words = pd.DataFrame()\n",
    "for t in range(len(coherence)):\n",
    "    label = topic_labels[t]\n",
    "    topic_coherence.append(coherence[t][1])\n",
    "    df = pd.DataFrame(coherence[t][0], columns=[(label, 'prob'), (label, 'term')])\n",
    "    df[(label, 'prob')] = df[(label, 'prob')].apply(lambda x: '{:.2%}'.format(x))\n",
    "    topic_words = pd.concat([topic_words, df], axis=1)\n",
    "                      \n",
    "topic_words.columns = pd.MultiIndex.from_tuples(topic_words.columns)\n",
    "pd.set_option('expand_frame_repr', False)\n",
    "topic_words.head().to_csv('topic_words.csv', index=False)\n",
    "print(topic_words.head())\n",
    "\n",
    "pd.Series(topic_coherence, index=topic_labels).plot.bar(figsize=(12,4));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Using `gensim` `Dictionary` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:34.822786Z",
     "start_time": "2020-06-20T19:41:32.670143Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "docs = [d.split() for d in train_docs.article.tolist()]\n",
    "docs = [[t for t in doc if t not in stop_words] for doc in docs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:35.410499Z",
     "start_time": "2020-06-20T19:41:34.824589Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "dictionary = Dictionary(docs)\n",
    "dictionary.filter_extremes(no_below=min_df, no_above=max_df, keep_n=max_features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:35.575497Z",
     "start_time": "2020-06-20T19:41:35.411526Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "corpus = [dictionary.doc2bow(doc) for doc in docs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:35.578635Z",
     "start_time": "2020-06-20T19:41:35.576446Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of unique tokens: 2000\n",
      "Number of documents: 2175\n"
     ]
    }
   ],
   "source": [
    "print('Number of unique tokens: %d' % len(dictionary))\n",
    "print('Number of documents: %d' % len(corpus))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:35.589256Z",
     "start_time": "2020-06-20T19:41:35.579591Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "num_topics = 5\n",
    "chunksize = 500\n",
    "passes = 20\n",
    "iterations = 400\n",
    "eval_every = None # Don't evaluate model perplexity, takes too much time.\n",
    "\n",
    "temp = dictionary[0]  # This is only to \"load\" the dictionary.\n",
    "id2word = dictionary.id2token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:50.290756Z",
     "start_time": "2020-06-20T19:41:35.590154Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "model = LdaModel(corpus=corpus,\n",
    "                 id2word=id2word,\n",
    "                 chunksize=chunksize,\n",
    "                 alpha='auto',\n",
    "                 eta='auto',\n",
    "                 iterations=iterations,\n",
    "                 num_topics=num_topics,\n",
    "                 passes=passes, \n",
    "                 eval_every=eval_every)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:50.294650Z",
     "start_time": "2020-06-20T19:41:50.291650Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.010*\"mobile\" + 0.009*\"use\" + 0.009*\"technology\" + 0.007*\"digital\" + 0.007*\"music\" + 0.007*\"phone\" + 0.007*\"used\" + 0.007*\"users\" + 0.006*\"online\" + 0.006*\"software\"'),\n",
       " (1,\n",
       "  '0.009*\"game\" + 0.008*\"England\" + 0.008*\"win\" + 0.007*\"good\" + 0.006*\"players\" + 0.006*\"play\" + 0.006*\"world\" + 0.006*\"think\" + 0.005*\"it\\'s\" + 0.005*\"second\"'),\n",
       " (2,\n",
       "  '0.008*\"market\" + 0.008*\"growth\" + 0.007*\"company\" + 0.007*\"economic\" + 0.006*\"sales\" + 0.006*\"firm\" + 0.005*\"economy\" + 0.005*\"chief\" + 0.005*\"oil\" + 0.005*\"prices\"'),\n",
       " (3,\n",
       "  '0.012*\"government\" + 0.011*\"Labour\" + 0.009*\"Blair\" + 0.008*\"election\" + 0.007*\"public\" + 0.007*\"Brown\" + 0.006*\"minister\" + 0.006*\"say\" + 0.006*\"BBC\" + 0.005*\"party\"'),\n",
       " (4,\n",
       "  '0.021*\"film\" + 0.019*\"best\" + 0.011*\"music\" + 0.008*\"British\" + 0.008*\"won\" + 0.007*\"including\" + 0.007*\"UK\" + 0.006*\"director\" + 0.006*\"BBC\" + 0.006*\"band\"')]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.show_topics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating Topic Assignments on the Test Set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:50.377947Z",
     "start_time": "2020-06-20T19:41:50.295592Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "docs_test = [d.split() for d in test_docs.article.tolist()]\n",
    "docs_test = [[t for t in doc if t not in stop_words] for doc in docs_test]\n",
    "\n",
    "test_dictionary = Dictionary(docs_test)\n",
    "test_dictionary.filter_extremes(no_below=min_df, no_above=max_df, keep_n=max_features)\n",
    "test_corpus = [dictionary.doc2bow(doc) for doc in docs_test]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:50.402241Z",
     "start_time": "2020-06-20T19:41:50.378889Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.07</td>\n",
       "      <td>25.66</td>\n",
       "      <td>0.10</td>\n",
       "      <td>0.09</td>\n",
       "      <td>74.47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29.10</td>\n",
       "      <td>0.10</td>\n",
       "      <td>3.22</td>\n",
       "      <td>5.93</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>55.59</td>\n",
       "      <td>2.44</td>\n",
       "      <td>11.77</td>\n",
       "      <td>4.53</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.22</td>\n",
       "      <td>0.10</td>\n",
       "      <td>59.93</td>\n",
       "      <td>0.09</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.07</td>\n",
       "      <td>33.09</td>\n",
       "      <td>5.10</td>\n",
       "      <td>0.09</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>34.79</td>\n",
       "      <td>0.10</td>\n",
       "      <td>40.08</td>\n",
       "      <td>26.69</td>\n",
       "      <td>3.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.07</td>\n",
       "      <td>0.10</td>\n",
       "      <td>59.90</td>\n",
       "      <td>0.09</td>\n",
       "      <td>5.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.07</td>\n",
       "      <td>29.09</td>\n",
       "      <td>0.10</td>\n",
       "      <td>0.09</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>9.29</td>\n",
       "      <td>0.10</td>\n",
       "      <td>72.86</td>\n",
       "      <td>0.09</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.65</td>\n",
       "      <td>0.10</td>\n",
       "      <td>39.69</td>\n",
       "      <td>17.90</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      0     1     2     3     4\n",
       "0  0.07 25.66  0.10  0.09 74.47\n",
       "1 29.10  0.10  3.22  5.93  0.07\n",
       "2 55.59  2.44 11.77  4.53  0.07\n",
       "3 23.22  0.10 59.93  0.09  0.07\n",
       "4  0.07 33.09  5.10  0.09  0.07\n",
       "5 34.79  0.10 40.08 26.69  3.74\n",
       "6  0.07  0.10 59.90  0.09  5.26\n",
       "7  0.07 29.09  0.10  0.09  0.07\n",
       "8  9.29  0.10 72.86  0.09  0.07\n",
       "9  1.65  0.10 39.69 17.90  0.07"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gamma, _ = model.inference(test_corpus)\n",
    "topic_scores = pd.DataFrame(gamma)\n",
    "topic_scores.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:50.408824Z",
     "start_time": "2020-06-20T19:41:50.403086Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.00</td>\n",
       "      <td>0.26</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.76</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.08</td>\n",
       "      <td>0.15</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.75</td>\n",
       "      <td>0.03</td>\n",
       "      <td>0.16</td>\n",
       "      <td>0.06</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.28</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.72</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.00</td>\n",
       "      <td>0.86</td>\n",
       "      <td>0.13</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3    4\n",
       "0 0.00 0.26 0.00 0.00 0.74\n",
       "1 0.76 0.00 0.08 0.15 0.00\n",
       "2 0.75 0.03 0.16 0.06 0.00\n",
       "3 0.28 0.00 0.72 0.00 0.00\n",
       "4 0.00 0.86 0.13 0.00 0.00"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "topic_probabilities = topic_scores.div(topic_scores.sum(axis=1), axis=0)\n",
    "topic_probabilities.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:50.417489Z",
     "start_time": "2020-06-20T19:41:50.409628Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    4\n",
       "1    0\n",
       "2    0\n",
       "3    2\n",
       "4    1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "topic_probabilities.idxmax(axis=1).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T19:41:50.565800Z",
     "start_time": "2020-06-20T19:41:50.418439Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZoAAAEICAYAAABmqDIrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3deXxU5aHG8d9MSCAQ2VdZJAFRULFSkHARUQgG2lKXAllgSixXTIUiKVxZwiaRrYhBwYRAVSRsA4gUVOq9gFetFKRAQRFJ2RtCKWExBEgySeb+oc4VFZjAnJyTk+fbz3w+nuTMmWdizZP3Pe+c4/B6vV5EREQM4jQ7gIiI2JuKRkREDKWiERERQ6loRETEUCoaERExlIpGREQMpaIREZHr2rNnDy6XC4D9+/cTHx+Py+ViyJAh5ObmXvO5KhoREbmmRYsWMWHCBAoLCwGYNm0aEydOJDMzk169erFo0aJrPl9FIyIi19SiRQvmzZvn237ppZdo27YtACUlJVStWvWaz69iaDqRCii0T6rZEQxxeWOS2RGkHIXeN9yv/RaP7Ybb7fZtx8TEEBMTc8U+0dHRZGdn+7YbNmwIwK5du1i6dCnLli275muoaEREKrEfKxZ/vPfee6Snp7Nw4ULq1q17zX1VNCIiduQw7szIn/70J9xuN5mZmdSuXfu6+6toRETsyBlkyGFLSkqYNm0aTZo04Xe/+x0AnTp1YsSIEVd9jopGRMSOHI6AHq5Zs2asWrUKgE8//bRMz1XRiIjYkYFTZ2WlohERsaMAj2huhopGRMSONKIRERFDaUQjIiKGMmjV2Y1Q0YiI2JGmzkRExFCaOhMREUNpRCMiIoZS0YiIiKGCtBhARESMpHM0IiJiKE2diYiIoSw0orFO5YlUEp3uaMz7s/r5tn/5H61Y/FwfExOJLTmc/j3KQYUqmu3bt9OlSxdcLheDBg0iNjaWQ4cOlekYw4f7d3tTESP8vl9H0p7tRbWQrycTXnz6IaYmPIDTaZ2/PsUmHA7/HuWgQhUNQGRkJJmZmSxdupThw4fzhz/8oUzPnz9/vkHJRK7v8MnzxL6wwbe9bX8OI+ZvNjGR2JYzyL9HeUQpl1cxSF5eHk2bNsXlcvlGNitWrGDevHkUFhaSmJjIoEGD6NevH9u3bwega9euALhcLqZNm0ZCQgL9+vXjxIkTAGRmZhITE0NsbCxLliwB4L//+7/p378/cXFxjB49mtLSUnbu3MmAAQOIj48nMTGR/Px8E34CUtGs++QgnuJS3/aaj7Lwek0MJPZloamzCrcYYNu2bbhcLoqKijhw4AAZGRn84x//+MF+x48fJzc3l8WLF3PmzBmOHj36g33at29PcnIyqampvPvuu/To0YP33nuP5cuX43A4SEhI4IEHHuCdd94hISGBn//856xbt478/Hw2bdpEr169GDJkCFu2bCEvL4+wsLBy+AmIiPjBQosBKlzRREZGkpqaCsDhw4eJjY3ltttu833f+82fh7fffjsDBw7k97//PcXFxbhcrh8cq127dgA0btyY3NxcsrKyyMnJISEhAYCvvvqK48ePM27cODIyMlixYgURERFERUWRmJjIggULGDx4MI0aNaJ9+/YGv3MRkTKw0PJm6yS5AfXr1wegZs2anD59GoAvvvgCgAMHDnDx4kUWLlzIzJkzSUlJue7xIiIiaN26NUuWLCEzM5MnnniCNm3a4Ha7+d3vfsfSpUsB+J//+R82bNjA448/TmZmJrfffrvvXtoiIpagqbMb9+3UmdPp5OLFi4wdO5Z69eoxdepUmjRpQsOGDQFo2bIlr776KuvWrSM4OJgRI0Zc99h33nknXbp0IS4ujqKiItq3b+8brTz55JPUrl2bGjVq8NBDD3H8+HHGjh1L9erVCQ4OZurUqUa/dbGJ4//Oo3vSSt/2x59l8/Fn2SYmEluy0P1oHF6vTkWKfFdon1SzIxji8sYksyNIOQp9bKFf+11eN9TgJBVwRCMiIn6w0DkaFY2IiB1p1ZmIiBjJoaIREREjqWhERMRQDgtdP09FIyJiQxrRiIiIoVQ0IiJiKBWNiIgYyzo9o6IREbEjK41orPPRURERCRin0+nXw1979uzxXQX/2LFjxMXFER8fz+TJkyktLb3mc1U0IiI25HA4/Hr4Y9GiRUyYMIHCwkIAZsyYwciRI1m+fDler5fNm699l1gVjYiIHTn8fPihRYsWzJs3z7e9b98+7r//fgAefPBBtm7des3n6xyNiIgN+TtacbvduN1u33ZMTAwxMTFX7BMdHU129v/fysLr9fqOX6NGDS5cuHDN11DRiIjYkL9F82PFcj3fPbdz8eJFatasee39y3R0ERGpEBxOh1+PG9GuXTu2b98OwEcffUTHjh2vub9GNCLfc26DPW8QVlBsdoLAe8q9x+wIhsgceO9NH8PI5c1jxoxh4sSJvPTSS0RERBAdHX3N/VU0IiI2FOiiadasGatWrQIgPDycpUuX+v1cFY2IiA1Z6QObKhoRERtS0YiIiLGs0zMqGhEROyrL5WWMpqIREbEhTZ2JiIixrNMzKhoRETvSiEZERAylohEREUOpaERExFA3eh0zI6hoRERsSCMaERExlIpGREQMZaGeUdGIiNiRRjQiImIopxYDiIiIkSw0oNGtnEXMsnfvHoYkuMyOEXB2e1/dIuowPqoV46NaMTm6Na/F3kP1YOv/6nQ6HX49yoMhI5rt27czcuRIWrdu7ftanTp1eOWVV350f7fbzRNPPEFwcPB1j11YWMj69evp37//VfdJSkpi1qxZhISElD18GZ0/f56PP/6Yvn37Gv5aYh9vvLaIdzasJzQ01OwoAWXH9/Xx4XN8fPgcAIM7NeWjQ2e55Ck1OdX1VYoRTWRkJJmZmb7H1UoGICMjg9JS//7FnT59mtWrV19zn9TU1HIpGYADBw6wZcuWcnktsY/mzVvw0svzzI4RcHZ9XwDhdUNpWqsaHxw8a3YUvzgcDr8e5aFcz9G4XC7uvPNO/vGPf5Cfn8/LL7/M1q1bOX36NElJSaSlpTFnzhx27NiB1+slISGBPn364HK5qFOnDnl5eTRr1oyDBw8yf/58+vXrx5QpUygsLOT8+fMMGzaMqKgoevTowcaNG5k8eTIhISGcOHGCf//738ycOZO77rqLXr16cd9993Hs2DEiIyO5cOECe/fuJTw8nNmzZ3Py5EkmTpxIYWEhVatWJSUlhZKSEkaNGkXjxo355z//yT333MPzzz/PggUL+PLLL3G73cTExJTnj1MqsKhHojlxItvsGAFn1/cF0Peuhrz92b/MjuE3K41oDCuabdu24XL9/zxt9+7dAWjfvj3Jycmkpqby7rvvMnToUNLT00lNTeXDDz8kOzublStXUlhYyIABA+jatSsAffv2pVevXmRnZ5OVlcXw4cPZunUrTz75JJ07d2bXrl3MmzePqKioK3LceuutTJ06lVWrVuF2u5k6dSonTpzgzTffpEGDBtx///2sXr2aiRMn0rNnT/Ly8pg1axYul4vu3bvz17/+lRdffJGkpCSOHj3Ka6+9RmhoKFFRUZw+fZrExERWrlypkhGxserBTm6tVY39py6aHcVvleLGZ5GRkaSmpl7xtQ8//JB27doB0LhxY3Jzc6/4flZWFvv27fMVVHFxMTk5OQCEh4f/4DUaNGhAeno6a9asweFwUFxc/IN92rZt63u9Xbt2AVC7dm1uvfVWAKpXr+47l3TLLbdQWFhIVlYWGRkZ/PGPf8Tr9frOHbVo0YKwsDDfaxcWFt7AT0ZEKpo7Goax7+QFs2OUSaUY0ZSFw+GgtLSUiIgIOnfuTEpKCqWlpaSlpdGsWTPfPvB1S397Pufll1+mf//+dO/enbfeeou33377R4/tz9e+KyIigt/85jd06NCBQ4cOsWPHjqs+77t5RMSemtSsyr/zi8yOUSaV4gOb3586AygoKPjRfTt27MjQoUNZsmQJn376KfHx8Vy6dImoqCjfCOJb9erVw+PxMHv2bHr37s20adPIyMigSZMmnDt3LiDZx4wZ4zv3U1BQQHJy8lX3bdGiBVlZWSxevJiEhISAvL5UDk2bNmPpilVmxwg4O76v9/afNjtCmVmoZ3B4vV6v2SFErKTghzOwYlFPufeYHcEQmQPvvelj/DTlA7/22znx4Zt+reuxxNSZiIgElpVGNCoaEREb0rXORETEUJViMYCIiJjHQj2johERsSONaERExFAW6hkVjYiIHQVqMYDH42Hs2LGcOHECp9NJSkoKrVq1KluWgCQRERFLCdTVmz/88EOKi4tZuXIlw4YNY+7cuWXOohGNiIgNBeocTXh4OCUlJZSWlpKfn0+VKmWvDRWNiIgN+dszbrcbt9vt246JibniavTVq1fnxIkT9OnTh3PnzrFgwYIyZ1HRiIjYkL8jmu8Xy/ctXryYBx54gFGjRnHy5EkGDx7Mhg0bqFq1qt9ZVDQiIjYUqFVnNWvW9N0qpVatWhQXF1NSUlKmY6hoRERsKFCrzhISEhg/fjzx8fF4PB6SkpKoXr16mY6hohERsSFngIY0NWrU4OWXX76pY6hoRERsSB/YFBERQ+kSNCIiYigL3SVARSNSGdyb/L7ZEQxzf/vGZkewJN2PRsTCqum/igolELc9tiMHKhoRETGQhQY0KhoRETvSYgARETGUhXpGRSMiYkeB+sBmIKhoRERsSKvORETEUBYa0KhoRETsSFNnIiJiKOvUjIpGRMSWtLxZREQMZaG1ACoaERE7qrCrzjwej++WniIiYl1WmjpzXm+HVatWMX36dACefvpp1q1bZ3goERG5OU6Hf49yyXK9HVasWMGoUaMAyMjIYMWKFYaHEhGRm+NwOPx6lIfrTp05nU6qVq0KQHBwsKWGYyIi8uOs9Jv6ukXTs2dP4uPjad++Pfv27aNHjx7lkUtERG5CUEVaDPDMM8/w8MMPc+TIER577DHuvPPO8sglIhVMcJCDGf3voXm9UPILipm6bj/HzlwyO1alZaXZp6ueo1m9ejUAc+bMYePGjXz55Ze89957vPTSS+UW7kZs376dLl264HK5cLlcDBgwgMzMzB/dNzs7mwEDBgCQlJREUVEROTk5bNmyBYBp06aRk5NTbtlFKrIB9zfnUlExMa9u54U/7WfiY23NjlSpORz+PcrDVUc0jRt/fR/uiIiIb0I78Hq95ZPqJkVGRpKamgpAUVERvXv35tFHH6VmzZpXfc63+2/bto3Dhw/To0cPkpOTyyWviB20blSDjw7kAnAk9xKtGtYwOVHlZqVrnV11RNOtWzcAevfuzVdffcXu3bu5ePEiv/jFL8otXCDk5+fjdDrJysoiLi6OQYMGMWTIkB+MVHr06MGlS5dYuHAh77zzDps3b8blcnHo0CHOnDnDU089RWxsLDExMRw9epSdO3cyYMAA4uPjSUxMJD8/36R3KGIN+3Mu8HDbBgDc26IWjWpWs9Sn0yubCjGi+daoUaOIiIigW7du7Nq1i3HjxvHiiy+WR7Ybtm3bNlwuFw6Hg+DgYCZOnMj06dOZNm0abdu2ZdOmTcycOZPnnnvuiucFBQUxdOhQDh8+TM+ePVm8eDEA6enp9OjRg7i4OP7617+yd+9e9u/fT69evRgyZAhbtmwhLy+PsLAwE96tiDW89bcTtGpYgyVDO7Hr2Dn2ncijtGJMgtiSlc7RXLdozp8/z+jRowGIiooiPj7e8FA367tTZ99KTk6mbduv54w7derEnDlz/D7ekSNH6NevHwBdunQBoHv37ixYsIDBgwfTqFEj2rdvH6D0IhXTPc1qsvPoeWa8c4C7m9akRb3qZkeq1IIsVDTX/cBm69at2blzJwAHDhzg1ltvxePxUFRUZHi4QGrYsCFffvklADt27KBly5Y/up/T6aS0tPSKr7Vq1YrPPvvM99zZs2ezYcMGHn/8cTIzM7n99ttZtWqVoflFrO5Y7iXiIpuz8pnOPBvdmpkbDpgdqVKz0pUBrjui2blzJ3/5y18IDg7G4/EAEB0djcPhYPPmzYYHDJQXXniBlJQUvF4vQUFBvsvqfF+bNm1IT0/nrrvu8n0tMTGR8ePHs379egCmT5/O2bNnGTt2LNWrVyc4OJipU6eWy/sQsapzlzw8+ce/mR1DvmGl82MOrx9LybxeL2fPnqVOnTo4ndcdBImIxdwx5n2zIxjmwKxosyNY0ig/R5Rz+t5hcBI/ps62b99OVFQUQ4YMISoqik8++cTwUCIicnMq1NTZ3LlzWb58OY0aNeLUqVMMHz6crl27lkc2ERG5QRZaC3D9ogkKCqJRo0YANGrUyHeBTRERsa4qAWyajIwMtmzZgsfjIS4ujv79+5cty/V2CAsLIzMzk06dOrFjxw5q1659w2FFRKR8BKpntm/fzu7du1mxYgWXL1/m9ddfL/Mxrls099xzDydPnmTu3LlERERQt27dGworIiLlJ1CXoPnLX/5CmzZtGDZsGPn5+T/4oLs/rlo0q1evZs2aNRw6dIhWrVoBX3+GpLi4+MYTi4hIufC3Z9xuN26327cdExNDTEyMb/vcuXPk5OSwYMECsrOz+e1vf8uf//znMl154KpF8+ijj9KlSxcyMjJITEwEvv4wY7169fw+uIiImMPfFWXfL5bvq127NhEREYSEhBAREUHVqlU5e/ZsmbrgqsubQ0JCaNasGSkpKTRt2pSmTZvSpEkTQkJC/D64iIiYI8jp8OtxPT/96U/5+OOP8Xq9nDp1isuXL5f5XP11z9GIiEjFE6jPyDz88MPs2LGDfv364fV6mTRpEkFBQWU6hopGRMSGHARuefONLAD4LhWNiIgNWelaZyoaEREbUtGIiIihKtSNz0REpOIJstCF9lU0IiI2FKgrAwSCikZExIZ0jkZsIfS+4WZHMMzl3fPNjiByUyw0oFHRiFQGugtl5eMM4OdobpaKRkTEhjSiERERQ1Wx0EkaFY2IiA1pRCMiIobS8mYRETGUhXpGRSMiYkcWujCAikZExI40dSYiIoZS0YiIiKGsUzMqGhERW7LQgEZFIyJiR7ofjYiIGEqrzkRExFBaDCAiIobS1JmIiBhKU2ciImIoK41orFR6IlfodPdtvL/oWQDat2nKptdG8v6iZ1n/6jAa1r3F5HQi1ubw81EeKt2IZuHChWzduhWn04nD4SApKYm77777ho+3dOlSBg0aFMCEAvD7wVHE/fx+Ll0uBODF5/rx+1mr2Zt1giG/6sqoJ3sxZs5ak1OKWFeQRjTmOHjwIFu2bOGNN97g9ddfZ/To0YwfP/6mjpmenh6gdPJdh7NziR29yLf967FvsDfrBABVgoIoKPSYFU2kQnA4/HuUh0o1oqlbty45OTmsWbOGBx98kLZt27JmzRpcLhfh4eEcOXIEr9dLamoqDRo0YObMmezcuROAX/ziFwwePJixY8dy/vx5zp8/T/fu3fnqq6+YMmUKU6ZMMffN2cy6zX+nRZO6vu1/5eYBEHlvOIkxD9LrP+eaFU2kQnBY6CI0la5o0tPTWbp0Ka+++irVqlUjKSkJgA4dOjB16lSWLVtGRkYGXbt2JTs7m1WrVlFcXEx8fDyRkZEAREZGkpCQAHw9daaSKR/9HunAc0OieXxEOrnn8s2OI2JpFpo5q1xFc+zYMcLCwpgxYwYAn332GUOHDqV+/fq+EunQoQNbtmyhcePGdOzYEYfDQXBwMPfeey+HDh0CIDw83LT3UFnF/qwT//mrrkQ/9TLn8i6ZHUfE8pwWGtFUqnM0Bw4cYMqUKRQWfn2COTw8nFtuuYWgoCA+//xzAHbt2kXr1q1p1aqVb9rM4/Gwe/dubrvtNuDKZYNer7ec30Xl43Q6mPNcP8JqVGPlnKd4f9GzTEj8mdmxRCxN52hM8sgjj3Do0CH69+9P9erV8Xq9PPfcc7z55pu8/fbbLF68mNDQUP7whz9Qp04dPv30U2JiYvB4PPTu3Zu77rrrB8ds1aoVo0eP5sUXXzThHdnb8ZNn6T54DgBNHxpjchqRiiXQl6A5c+YMTzzxBK+//jqtWrUq03MdXv1JjsvlYsqUKWX+4VV2ofcNNzuCYS7vnm92BJGbsvnLXL/263ln/evu4/F4GDlyJAcPHiQtLa3Mvysr1dSZiEhl4fDzf/6YNWsWsbGxNGzY8IayVKqps6vJzMw0O4KISED5O3Pmdrtxu92+7ZiYGGJiYnzba9eupW7dunTr1o2FCxfeWBZNncmN0tSZiHX974Gzfu330B11r/n9gQMH4nA4cDgc7N+/n5YtW5Kenk6DBg38zqIRjYiIDTkDtBZg2bJlvn/+9nx2WUoGVDQiIrakG5+JiIihjKiZGz2fraIREbEhjWhERMRQ1qkZFY2IiD1ZqGlUNCIiNqSpMxERMZR1akZFIyJiTxZqGhWNiIgN6Q6bIiJiKAudolHRiIjYkYV6RkUjImJHDgsNaVQ0IiI2ZKGeUdGUl4fmbjU7QsB1HhzP9jeXmx1D/FBQbHYC41TTb7EfZaGeUdHIzdF9W0QsykJNo6IREbEhLW8WERFD6RyNiIgYSkUjIiKG0tSZiIgYSiMaERExlIV6RkUjImJLFmoaFY2IiA3pxmciImIo69SMikZExJ4s1DQqGhERG9LyZhERMZSFTtGoaERE7MhCPaOiERGxI934TEREDGWhnlHRiIjYkYV6BqfZASQwgpwOJvS+nfkD7uaV/nfTok6o2ZGkktq7dw9DElxmxxCHn49yUGlHNDNnzmTfvn2cPn2agoICmjdvTp06dXjllVeu+9wePXqwceNGqlatWg5J/RPZsjZBTgfDV33OT1vUYsh/tGDyuwfMjiWVzBuvLeKdDesJDdUfOmbT8mYLGDt2LABr167l8OHDjB492uREN+ef5woIcn79f60aIUGUlHrNjiSVUPPmLXjp5Xkkj33O7CiVns7RWJDH42Hy5MkcO3aM0tJSRo4cSefOnfnggw+YP38+AO3ateP5558HYMqUKWRnZwMwf/58atWqZVp2gMueEhrXrMqSwfdRK7QK4/70pal5pHKKeiSaEyeyzY4hgDNARePxeBg/fjwnTpygqKiI3/72t/Ts2bNsWQITpeJbvXo1derUYdmyZaSlpTF16lSKi4tJSUlh4cKFvPXWWzRq1Ih//etfAPzqV78iMzOTpk2b8sknn5icHvp3uJUdx87jenM3Q5buYdwjrQkJstCfNCJSzgJzkmb9+vXUrl2b5cuXs2jRIlJSUsqcRCOab2RlZbFz50727t0LQHFxMWfOnKFmzZrUq1cPgOHDh/v2v/vuuwGoX78+BQUF5R/4ey4UFPumyy4UFFMlyPHN1Vs1hSZSGQVq6qx3795ER0f7toOCgsp8DBXNNyIiImjcuDGJiYkUFBSQnp5OgwYNyMvL4/z589SuXZsXXniBX/7yl4C1PgwFsGZ3Ds/1as0r/e+mSpCDRZ8cp6C41OxYImISf39Dud1u3G63bzsmJoaYmBjfdo0aNQDIz89nxIgRjBw5ssxZVDTfiI2NZcKECQwaNIj8/Hzi4+NxOp1MnjyZp59+GqfTSbt27bjnnnvMjvqjLntKef69LLNjiNC0aTOWrlhldoxKz9+/hb9fLD/m5MmTDBs2jPj4ePr27Vv2LF6vV3Mr5eChuVvNjmCI/x35H2ZHED8UFJudwDjV9Ofyj/pXnsev/RrXDL7m93Nzc3G5XEyaNIkuXbrcUBYtBhARsaFAfV5zwYIF5OXlkZaWhsvlwuVylfm8tEY05UQjGjGTRjSVz78v+DeiaXjLtUc0gaB/RSIiNqQrA4iIiLGs0zMqGhERO7JQz6hoRETsyGmhz/qpaEREbMhCPaPlzSIiYiyNaEREbMhKIxoVjYiIDWl5s4iIGEojGhERMZSKRkREDKWpMxERMZRGNCIiYigL9YyKRkTElizUNCoaEREbstIlaHQ/GhERMZQuQSMiIoZS0YiIiKFUNCIiYigVjYiIGEpFIyIihlLRiIiIoVQ0NlFaWsqkSZOIiYnB5XJx7NgxsyMF1J49e3C5XGbHCAiPx8N//dd/ER8fT79+/di8ebPZkQKipKSEcePGERsby8CBAzl+/LjZkQLmzJkzdO/enUOHDpkdpUJS0djEpk2bKCoqwu12M2rUKGbOnGl2pIBZtGgREyZMoLCw0OwoAbF+/Xpq167N8uXLWbRoESkpKWZHCogPPvgAgJUrVzJixAhmzJhhcqLA8Hg8TJo0iWrVqpkdpcJS0djEzp076datGwA/+clP+Pzzz01OFDgtWrRg3rx5ZscImN69e/Pss8/6toOCgkxMEzhRUVG+0szJyaF+/fomJwqMWbNmERsbS8OGDc2OUmGpaGwiPz+fsLAw33ZQUBDFxcUmJgqc6OhoqlSxz9WSatSoQVhYGPn5+YwYMYKRI0eaHSlgqlSpwpgxY0hJSSE6OtrsODdt7dq11K1b1/dHnNwYFY1NhIWFcfHiRd92aWmprX45283Jkyf59a9/zaOPPkrfvn3NjhNQs2bN4v3332fixIlcunTJ7Dg35a233mLr1q24XC7279/PmDFjOH36tNmxKhz9JrKJDh068MEHH/Czn/2Mv//977Rp08bsSHIVubm5/OY3v2HSpEl06dLF7DgBs27dOk6dOsXTTz9NaGgoDoejwk8LLlu2zPfPLpeLKVOm0KBBAxMTVUwqGpvo1asXn3zyCbGxsXi9XqZPn252JLmKBQsWkJeXR1paGmlpacDXCx4q+snmRx55hHHjxjFw4ECKi4sZP348VatWNTuWWICu3iwiIobSORoRETGUikZERAylohEREUOpaERExFAqGhERMZSKRqScJSUlsX37dj766CPcbvdV93O73Xg8Hr+OuWLFCltdpkfsRZ+jETHJgw8+eM3vZ2Rk8Nhjj5VTGhHjqGhEymDt2rVs3ryZ/Px8zp07x7Bhw5g3bx4tW7YkJCSE559/nuTkZM6dOwfAhAkTuOOOO1i2bBmrV6+mQYMGnDlzxnesw4cPM3r0aNLS0ti0aRMlJSXExcURFBTE6dOnSUpKIi0tjTlz5rBjxw68Xi8JCQn06dOHv/3tb0yfPp1atWrhdDr5yTMJVpoAAAJDSURBVE9+YuaPRuSqVDQiZXTp0iXeeOMNzp49S//+/SkpKeGZZ56hXbt2zJ49m8jISOLj4zl69Cjjxo1j4cKFLFmyhA0bNuBwOHjiiSeuON4XX3zBRx99xOrVqykqKmLOnDkkJyeTnp5OamoqH374IdnZ2axcuZLCwkIGDBhA165dmTFjBnPmzCE8PJzJkyeb9NMQuT4VjUgZderUCafTSf369alZsyaHDh0iPDwcgKysLLZt28bGjRsByMvL4/Dhw7Ru3ZqQkBAA2rdvf8Xxjhw5Qvv27QkKCiI0NJQJEyZc8f2srCz27dvnu/FbcXExOTk5nDp1yve6HTp0sNWNxsRetBhApIz27dsHfH1xzPz8fOrVq4fT+fV/ShERESQkJJCZmcncuXPp27cvzZs35+DBgxQUFFBSUsL+/fuvOF5ERARffPEFpaWleDwennzySYqKinA4HJSWlhIREUHnzp3JzMzkzTffpE+fPjRr1owGDRr47vj42Wefle8PQaQMNKIRKaPc3FwGDx7MhQsXmDx5MlOmTPF9LzExkeTkZFatWkV+fj7Dhw+nbt26PPvss8TGxlK3bl1CQ0OvOF7btm3p1q0bcXFxlJaWEhcXR0hICB07dmTo0KEsWbKETz/9lPj4eC5dukRUVBRhYWHMnj2bMWPGUKNGDWrUqEGtWrXK+Sch4h9dVFOkDL57Al9E/KOpMxERMZRGNCIiYiiNaERExFAqGhERMZSKRkREDKWiERERQ6loRETEUCoaEREx1P8Bf4ex8TDcC5EAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "predictions = test_docs.topic.to_frame('topic').assign(predicted=topic_probabilities.idxmax(axis=1).values)\n",
    "heatmap_data = predictions.groupby('topic').predicted.value_counts().unstack()\n",
    "sns.heatmap(heatmap_data, annot=True, cmap='Blues');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resources\n",
    "\n",
    "- pyLDAvis: \n",
    "    - [Talk by the Author](https://speakerdeck.com/bmabey/visualizing-topic-models) and [Paper by (original) Author](http://www.aclweb.org/anthology/W14-3110)\n",
    "    - [Documentation](http://pyldavis.readthedocs.io/en/latest/index.html)\n",
    "- LDA:\n",
    "    - [David Blei Homepage @ Columbia](http://www.cs.columbia.edu/~blei/)\n",
    "    - [Introductory Paper](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) and [more technical review paper](http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf)\n",
    "    - [Blei Lab @ GitHub](https://github.com/Blei-Lab)\n",
    "    \n",
    "- Topic Coherence:\n",
    "    - [Exploring Topic Coherence over many models and many topics](https://www.aclweb.org/anthology/D/D12/D12-1087.pdf)\n",
    "    - [Paper on various Methods](http://www.aclweb.org/anthology/N10-1012)\n",
    "    - [Blog Post - Overview](http://qpleple.com/topic-coherence-to-evaluate-topic-models/)\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "name": "_merged",
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "203.153px",
    "left": "69.9915px",
    "right": "1064px",
    "top": "66.3352px",
    "width": "302px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
